Packages

set.seed(0911)
library(ggplot2)
library(gridExtra)
library(cowplot)
library(plotly) # interactif plot
library(ggfortify) # diagnostic plot
library(forestmodel) # plot odd ratio
library(arm) # binnedplot diagnostic plot in GLM


library(knitr)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(broom) # funtion augment to add columns to the original data that was modeled
library(effects) # plot effect of covariate/factor
library(questionr) # odd ratio


library(lmtest) # LRtest
library(survey) # Wald test
library(vcdExtra) # deviance test





library(rsample)   # for data splitting
library(glmnet)
library(nnet) # multinom, glm
library(caret)
library(ROCR)
#library(PRROC) autre package pour courbe roc et courbe pr
library(ISLR) # dataset for statistical learning

 







ggplot2::theme_set(ggplot2::theme_light())# Set the graphical theme

1.Introduction

The optimal prediction of \(Y\) conditionnaly on \(x\) is the regression function \(h(x)=\text{E}[Y\vert x]\). In previous chapter, we assumed \(h(x)\) is linear with respect to \(x\) : \[h(x)=\text{E}[Y\vert x]=x^T\beta, \quad\text{where}\quad Y=x^T\beta+\epsilon\text{, with}\epsilon\sim\mathcal{N}(0,\sigma^2).\]

😕 This linear assumption, however, limits us to continuous responses and does not handle categorical outcomes, thus restricting our ability to address classification tasks.

To broaden our approach, we introduce models that maintain a linear predictor \(\,\,\eta(x)=x^T\beta\) but incorporate a link function \(g\) to map the conditional expectation of \(Y\) to the linear predictor. Specifically, we assume \[ g(\text{E}_\beta[Y\vert x])=x^T\beta, \] where \(g(\cdot)=h^{-1}\) is known as the link function. Consequently, we can express the conditional expectation as \[ \text{E}[Y\vert x]=g^{-1}(\eta(x))=g^{-1}(x^T\beta). \] This formulation, which includes the linear predictor with a link function, allows for greater flexibility in modeling responses of various types, including binary and categorical outcomes.

1.1. Natural exponential family


[Definition 1]
A random variable \(Y\) is said to have a probability density with respect to a dominant measure \(\nu\), denoted by \(f_{\theta,\phi}\), and to belong to the natural exponential family \(\mathcal{F}_\theta^{Nat}\) if \(f_{\theta,\phi}\) can be expressed as \[f_{\theta,\phi}(y)=\exp\left(\frac{y\theta-b(\theta)}{\phi}+c(y,\phi)\right), \] where \(b(\cdot)\) and \(c(\cdot)\) are known and differentiable functions, satisfying the following properties:

  • \(b(\cdot)\) is 3 times differentiable,
  • \(b'(\cdot)\) is invertible, \(i.e.\) \((b')^{-1}(\cdot)\) exists.
  • \(\theta\in\Theta\subseteq\mathbb{R}\), \(\phi\in\mathcal{B}\subseteq\mathbb{R}^+_*\) is the natural parameter
  • \(\phi\) is the dispersion parameter.




[Proposition 1]
If \(\,Y\) admits a density belonging to the natural exponential family \(\mathcal{F}_\theta^{Nat}\) then

  • The expected value of \(Y\) is \(\text{E}_\theta[Y]=b'(\theta).\)
  • The variance of \(Y\) is \(\text{Var}_\theta[Y]=b''(\theta)\phi\)



METHOD

1.\(\,\) Select the conditional distribution of \(Y\vert x\). Choose a probability distribution for \(Y\vert x\) from the natural exponential family.

2.\(\,\) Define the linear predictor. Set \(\eta(x):=x^T\beta\) and select an appropriate link function. Often, the canonical link function is used for simplicity and interpretability.

3.\(\,\) Estimate the parameter \(\beta\). Obtain an estimate \(\widehat{\beta}_n\) of the unknown parameter \(\beta\) using a sample \(\mathcal{D}_n\{(Y_i,x_i)\}_{i=1,\cdots,n}\). Then, the estimated mean response is given by: \[g^{-1}(X\widehat{\beta}_n)\quad\text{where}\quad X=(x_1,\cdots,x_n)^T.\]


1.2. Maximum likehood estimator (MLE)

Let \(Y=(Y_1,\cdots,Y_n)^\top\) represent the response vector, and define the design matrix as: \[X = \begin{pmatrix} x_{11}&\cdots &x_{1p} \\ \vdots& &\vdots\\ x_{n1}&\cdots &x_{np} \end{pmatrix}= (X_1,\cdots,X_p)=\begin{pmatrix} x_{1}^T\\ \vdots\\ x_{n}^T \end{pmatrix},\] where each \(X_j\), \(j=1\cdots,p\) represents one of the explanatory variables.


Let \(\mathcal{L}(\beta)\) denote the log of the likelihood function. Given that the \(Y_i\) are independent, we have: \[\mathcal{L}(\beta)=\sum_{i=1}^n\log f_{\theta_i,\phi}(Y_i)=\sum_{i=1}^n\mathcal{L}_i(\beta), \] where \(\mathcal{L}_i(\beta)\) is the contribution of the \(i^\text{th}\) observation \((Y_i,x_i)\) to the log of the likelihood: \[\mathcal{L}_i(\beta)=\ell(Y_i,\theta_i,\phi,\beta)=\log f_{\theta_i,\phi}(Y_i)=\frac{Y_i\theta_i-b(\theta_i)}{\phi}+c(Y_i,\phi).\]



[Proposition 2]
The likelihood equations are given by \[\frac{\partial\mathcal{L}(\beta)}{\partial\beta_j}=\sum_{i=1}^n\frac{Y_i-\mu_i}{ \text{Var}[Y_i]}h'(\eta_i) x_{i,j}=0,\quad j=1,\cdots,p \] In matrix form, the gradient can be expressed as: \[ \nabla\mathcal{L}(\beta)=\left[\frac{\partial\mathcal{L}(\beta)}{\partial\beta_1},\cdots,\frac{\partial\mathcal{L}(\beta)}{\partial\beta_p} \right]^T=\boldsymbol{0}_p. \]

1.4 Logistic vs Regression example

To illustrate the differences between logistic and linear regression, consider a binary response variable \(Y\), where \(Y\) takes values in \(\{0,1\}\). In this context, \(Y\) represents whether a payment default occurs.

ISLR::Default %>%rmarkdown::paged_table()

Linear Regression Model We begin with a simple linear regression model to predict the probability of default.

The following code visualizes a linear regression fit for predicting \(Y\) as a function of balance:

p1 <- ISLR::Default %>%
  mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
  ggplot(aes(balance, prob)) +
  geom_point(alpha = .15) +
  geom_smooth(method = "lm") +
  ggtitle("Linear regression model fit") +
  xlab("Balance") +
  ylab("Probability of Default")
ggplotly(p1)

📈 The plot suggests that the linear model may not be suitable for this binary outcome, as it can predict probabilities outside the \([0,1]\) range. Therefore, selecting an appropriate distribution for\(Y\vert x\) is crucial; a Bernoulli distribution for \(Y\) conditioned on \(x\) is more appropriate in this case, with: \[ p(x)=P(Y=1\vert x)\text{ and } \mu(x)=\text{E}[Y\vert x]=p(x). \]


Logistic Regression Model To correctly model the relationship, we choose the canonical logit link function: \[g(\mu(x))=g(p(x))=\text{logit}(p(x))=\log\left(\frac{p(x)}{1-p(x)}\right).\] With \(\eta(x)=x^T\beta\) and \(\widehat{\beta}_n\) as an estimator of \(\beta\) from \(n\) observations, we estimate \(\text{E}[Y\vert x]=p(x)\) using: \[\widehat{p}(x)=g^{-1}(\widehat{\eta}(x))=g^{-1}(x^T\widehat{\beta}_n)=\frac{e^{x^T\widehat{\beta}_n}}{1+e^{x^T\widehat{\beta}_n}}. \]

The code below compares the logistic model fit:

p2 <- ISLR::Default %>%
  mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
  ggplot(aes(balance, prob)) +
  geom_point(alpha = .15) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  ggtitle("Logistic regression model fit") +
  xlab("Balance") +
  ylab("Probability of Default")

subplot(ggplotly(p1), ggplotly(p2), nrows = 1)

📈 A typical classification rule is to assign a predicted value of \(\widehat Y_i\) if \(\widehat{p}_i=\widehat{p}(x_i)>s\) , with \(s=0.5\) commonly used as a threshold.

2. Logistic regression

To begin with, let us recall that for the canonical link, the likelihood equations simplify to: \[\sum_{i=1}^n\frac{(Y_i-\mu_i)x_{i,j}}{\phi} =0,\quad j=1,\cdots,p. \] Next, we consider a logistic regression model where \[Y_i\vert x_i \sim\mathcal{B}(p_i)\] with \(\mu_i=p_i=\frac{e^{x_i^T\beta}}{1+e^{x_i^T\beta}}\) and \(\phi=1\). Consequently, the likelihood equations can be expressed as: \[\sum_{i=1}^n\left(Y_i-\frac{e^{x_i^T\beta}}{1+e^{x_i^T\beta}}\right)x_{i,j} =0,\quad\forall j=1,\cdots,p.\]


📝 Remarks.
  • It is important to note that there is generally no closed-form solution for these equations.
  • Instead, efficient approximation algorithms, such as the Newton-Raphson algorithm, are commonly employed to find estimates.

2.1. Uploading the dataset

For this chapter, we will use the binary.csv dataset, which contains information regarding the admissions of students to a new establishment.

The dataset includes:

  • The response variable (admit), which is a factor with two levels: 0 (not admitted) and 1 (admitted).
  • Two covariates to explain the admission: the Graduate Record Examination score (gre), and the Grade Point Average (gpa).
  • The establishment’s rank, represented by the factor rank, which has four levels (1, 2, 3 and 4). Note that a rank 1 establishment is more prestigious than a rank 2 establishment.

To load and view the dataset, use the following code:

mydata<-read.csv("binary.csv",header=T,sep=",")
mydata$admit<-factor(mydata$admit)
mydata$rank<-factor(mydata$rank)
mydata%>%rmarkdown::paged_table()

You can also obtain a summary of the dataset with:

summary(mydata)
##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000

To avoid any confusion regarding the levels of the response variable, let us rename them:

levels(mydata$admit)[levels(mydata$admit) %in% c('0', '1')] <-c('No', 'Yes')
table(mydata$admit) %>% as.data.frame() %>% setNames(c("Admit", "Counts")) %>%  kableExtra::kbl()
Admit Counts
No 273
Yes 127

For sake of simplicity, we can define:

m0<-levels(mydata$admit)[1]
m1<-levels(mydata$admit)[2]

📝 Remarks.
  • The order of the levels in the response variable is significant.
  • In machine learning terminology, the class Y=1 is referred to as the positive class, while the class Y=0 is termed the negative class.
  • Typically, the reference level (the first level in the table) is assigned to the majority class, which is considered the negative class (here, 0=No). If necessary, you can change the reference level using the following command:
mydata$admit<-relevel(mydata$admit, ref='Yes')
table(mydata$admit)

Split the dataset (Train and Test) To evaluate the performance of our logistic regression model, we need to split the dataset into training and testing sets. This allows us to train the model on one subset of the data and validate its performance on another, unseen subset. To achieve a more representative split, particularly in terms of the outcome variable (admission), we will use stratified sampling. This ensures that the proportion of each class in the response variable is maintained in both the training and testing sets.

The code is the following:

#library(rsample) 
set.seed(0911)
#Create a stratified split of the data, maintaining the proportion of classes
mydata_split <- initial_split(mydata, prop = 0.7, strata = "admit")
#mydata_split <- initial_split(mydata, prop = .7)
train <- training(mydata_split)
test  <- testing(mydata_split)

2.2. Mathematic formalism

Define the mathematical framework for our analysis of student admissions.

  • Rank of Establishment. Let \(F\) denote the rank of the establishment of origin, represented by the factor rank, which can take on \(J=4\) levels (1, 2, 3 and 4). Notably, an establishment with rank 1 is considered more prestigious than one with rank 2.`
  • Response Variable. The response variable \(Y_{ij}\) is defined as a binary factor (admit) with two modalities. Specifically, \(Y_{ij}=1\) corresponds to the individual \(i\) not being admitted from rank \(j\) (i.e., \(Y_{ij}=1\)=Yes), and \(Y_{ij}=0\) signifies admission (i.e., \(Y_{ij}=0=\)No).
  • Covariates. To explain the admission process, we consider two covariates: \(X^{(1)}\) representing the Graduate Record Examination score (gre), and \(X^{(2)}\) denoting the Grade Point Average (gpa).

We can express the model in the following form: \[x_{ij}^T\beta=\mu+\alpha_j+\beta_1 X_{ij}^{(1)}+\beta_2 X_{ij}^{(2)},\quad\text{where}\quad Y_{ij}\vert x_{ij}\sim\mathcal{B}(p(x_{ij})) \] with the expected value given by \[\text{E}[Y_{ij}\vert x_{ij}]=p(x_{ij})\quad\text{and}\quad p(x_{ij})=\frac{e^{x_{ij}^T\beta}}{1+e^{x_{ij}^T\beta}}.\] It is important to note that \(p(x_{ij})=\mathbb P(Y=1)= \mathbb P(Y=Yes)\).

2.3. Simple logistic regression with glmfunction

We will define three simple logistic regression models to explore the effects of the covariates on admission:

  • Model 1. This model considers only the covariate gre: \[\text{E}[Y_{i}\vert x_{i}]=p(x_{i})\quad\text{where}\quad x_{i}=\mu+\beta_1 X_{i}^{(1)}.\]

  • Model 2. This model incorporates only the covariate gpa: \[\text{E}[Y_{i}\vert x_{i}]=p(x_{i})\quad\text{where}\quad x_{i}=\mu+\beta_1 X_{i}^{(2)}.\]

  • Model 3. In this model, we examine the effect of the factor rank: \[\text{E}[Y_{ij}\vert x_{ij}]=p(x_{ij})\quad\text{where}\quad x_{ij}=\mu+\alpha_j.\]

The models are implemented using the following R code:

model1 <- glm(admit ~ gre, family = "binomial", data = train)
model2 <- glm(admit ~ gpa, family = "binomial", data = train)
model3 <- glm(admit ~ rank, family = "binomial", data = train)

We can summarize the results of each model and present the coefficients using the following code:

model1%>% summary()%>% coefficients%>%knitr::kable()
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6751786 0.7217154 -3.706695 0.0002100
gre 0.0031948 0.0011790 2.709660 0.0067352
model2%>% summary()%>% coefficients%>%knitr::kable()
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.582399 1.2572904 -3.644663 0.0002677
gpa 1.110625 0.3617611 3.070052 0.0021402
model3%>% summary()%>% coefficients%>%knitr::kable()
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3022809 0.3198465 0.945081 0.3446175
rank2 -0.7783635 0.3766212 -2.066701 0.0387623
rank3 -1.7457336 0.4183470 -4.172932 0.0000301
rank4 -1.9398897 0.5224335 -3.713180 0.0002047

To visualize the impact of the covariates on the predicted probabilities of admission, we create plots for each model.

train2 <- train %>% mutate(prob = ifelse(admit == m1, 1, 0))
p1 <-train2 %>% ggplot( aes(gre, prob)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  xlab("gre") +ggtitle("Model 1")+ ylab("P(Y=1)")

p2 <-train2%>% ggplot( aes(gpa, prob)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +ggtitle("Model 2") +
  xlab("gpa")  +
  ylab("P(Y=1)")

train3 <- broom::augment(model3, train2) %>% mutate(.fitted = exp(.fitted))
p3 <-train3 %>% ggplot( aes(rank, .fitted, color = rank)) +geom_boxplot() +
 geom_rug(sides = "b", position = "jitter", alpha = 0.2, show.legend = FALSE) +
  ggtitle("Model 3") +
  xlab("rank")  +
  ylab("P(Y=1)")

subplot(ggplotly(p1), ggplotly(p2), ggplotly(p3),nrows = 1)

Key code \(\,\)

  • dplyr::mutate(). This function is a powerful tool from the dplyr package that allows for the creation of new columns derived from existing variables within a dataset. In this context, it is employed to create a new column that quantifies the probability of admission, which is derived from the binary response variable.
  • broom::augment() The augment() function from thebroom package plays a crucial role in enhancing the original dataset used for modeling. It appends additional columns, including predicted values, residuals, and other diagnostic metrics. This functionality allows for a more intuitive visualization and evaluation of the model’s performance, making it easier to assess how well the logistic regression model fits the observed data.

2.4. Multivariate logistic model

Now, let’s extend our analysis to a complete logistic regression model, represented as follows: \[ Y_{ij}\vert x_{ij}\sim\mathcal{B}(p(x_{ij}))\quad\text{where}\quad x_{ij}^T\beta=\mu+\alpha_j+\beta_1 X_{ij}^{(1)}+\beta_2 X_{ij}^{(2)}\quad\text{and}\quad p(x_{ij})=\frac{e^{x_{ij}^T\beta}}{1+e^{x_{ij}^T\beta}}.\]


Model Declaration. There are several approaches to declare a logistic regression model in R, and we will explore a few of them here.

Using the multinom() Function from the nnet Package. This method is particularly useful for multinomial logistic regression, allowing for the modeling of response variables with more than two categories.

mmodel_multi <- multinom(admit~.,data=train)


Using the glm() Function. The Generalized Linear Model (glm) function provides a flexible framework for modeling various types of data, including binomial outcomes. Here, we specify the response variable admit in relation to the covariates rank, gpa, and gre.

model_glm <- glm(admit ~ rank+gpa+gre,family = "binomial",data = train )


Using the train() Function from the caret Package. The caret package is invaluable for building predictive models in . By utilizing the train() function, we can efficiently create a logistic regression model while applying cross-validation to enhance model reliability.

model_caret <- train(admit ~ ., data = train,method = "glm",
family = "binomial",trControl = trainControl(method = "cv", number = 10)))


3. Complete study of the summary of a multivariate logistic regression model

model_glm <- glm(
  admit ~ rank+gpa+gre,
  family = "binomial", 
  data = train
  )
summary(model_glm)
## 
## Call:
## glm(formula = admit ~ rank + gpa + gre, family = "binomial", 
##     data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.847985   1.396641  -2.755 0.005866 ** 
## rank2       -0.699451   0.385413  -1.815 0.069554 .  
## rank3       -1.707940   0.426848  -4.001  6.3e-05 ***
## rank4       -1.810166   0.531629  -3.405 0.000662 ***
## gpa          0.883593   0.409500   2.158 0.030948 *  
## gre          0.001780   0.001342   1.326 0.184769    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 347.84  on 278  degrees of freedom
## Residual deviance: 311.70  on 273  degrees of freedom
## AIC: 323.7
## 
## Number of Fisher Scoring iterations: 4

1. Number of Fisher Scoring iterations The summary output indicates the number of iterations required for convergence. Fisher’s scoring algorithm is a variant of Newton’s method specifically designed for solving maximum likelihood problems numerically.

In this model, it took four iterations to achieve convergence.

summary(model_glm)$iter
## [1] 4

2. Coefficients We assess the significance of each coefficient individually using a Gaussian test. As established in Theorem 1, the following asymptotic relationship holds: \[ I^{1/2}(\widehat{\beta}_n^{ML})\sqrt{n}( \widehat{\beta}_n^{ML}-\beta_0)\overset{\mathcal{D}}{\longrightarrow}\mathcal{N}(\boldsymbol{0}_p,\textbf{I}_p). \]

The coefficients and their associated statistics can be extracted as follows:

summary(model_glm)$coefficients
##                 Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -3.847984992 1.396641426 -2.755170 5.866159e-03
## rank2       -0.699451040 0.385413495 -1.814807 6.955360e-02
## rank3       -1.707940071 0.426847638 -4.001287 6.299879e-05
## rank4       -1.810165945 0.531628860 -3.404943 6.617787e-04
## gpa          0.883593421 0.409500429  2.157735 3.094844e-02
## gre          0.001779811 0.001342024  1.326214 1.847689e-01

Moreover, we can perform the Wald test to further evaluate the null hypothesis regarding the coefficients.


[Proposition 4]
Consider the hypothesis test: \[ \mathcal{H}_0 \ : \ \beta_j=0, \ vs \ \mathcal{H}_1\ : \ \beta_j\neq 0. \] Under certain assumptions and under \(\mathcal{H}_0\) \[ S:=n\left[ I(\widehat{\beta}^{ML})\right]_{jj}\left(\widehat{\beta}^{ML}_j\right)^2\overset{\mathcal{D}}{\longrightarrow}\chi_{1}^2. \] For a fixed significance level \(\alpha \in (0, 1)\), the rejection region is defined as: \[\left\{S\geq q_{1-\alpha}^{\chi^2_{1}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{1}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with 1 degree of freedom.

#library(survey)
regTermTest(model_glm,"gre")
## Wald test for gre
##  in glm(formula = admit ~ rank + gpa + gre, family = "binomial", 
##     data = train)
## F =  1.758843  on  1  and  273  df: p= 0.18588
regTermTest(model_glm,"gpa")
## Wald test for gpa
##  in glm(formula = admit ~ rank + gpa + gre, family = "binomial", 
##     data = train)
## F =  4.65582  on  1  and  273  df: p= 0.031821

The Wald test can also be applied to categorical factors.


[Proposition 5]
Consider the hypothesis test: \[ \begin{cases} \mathcal{H}_0 & \alpha_{(-1)}&=(\alpha_2,\cdots,\alpha_J)^T=\boldsymbol{0}_{J-1},\\ \mathcal{H}_1 & \alpha_{(-1)}&\neq \boldsymbol{0}_{J-1}. \end{cases} \] Under certain assumptions and under \(\mathcal{H}_0\) \[S:=\left\|\sqrt n \,\text{I}\left(\widehat{\beta}_{(-1)}^{ML}\right)\widehat{\alpha}^{ML}_{(-1)}\right\| ^2\overset{\mathcal{D}}{\longrightarrow}\chi_{J-1}^2. \]

For a fixed significance level \(\alpha \in (0, 1)\), the rejection region is: \[\left\{S\geq q_{1-\alpha}^{\chi^2_{1}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{J-1}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with \(J - 1\) degrees of freedom.



📝 Remark.
    Note that for categorical variables, the Wald test has a different formulation due to the additional constraint (here \(\alpha_1 = 0\)).

Display Wald test for categorial variable rank.

regTermTest(model_glm,"rank")
## Wald test for rank
##  in glm(formula = admit ~ rank + gpa + gre, family = "binomial", 
##     data = train)
## F =  7.264522  on  3  and  273  df: p= 0.000105

3. Deviance To define deviance, we first introduce the concept of the saturated model:

  • Denote \([m_{sat}]\) as the saturated model, which occurs when \(p \geq n\), meaning \(\widehat{\text{E}[Y_i\vert x_i]} = y_i\) (this scenario typically leads to overfitting).
  • The saturated model \([m_{sat}]\) is the most complex model, and all other models can be considered as subsets, \([m] \subseteq [m_{sat}]\).
  • Thus, if a simpler (more parsimonious) model \([m]\) achieves a log of the likelihood \(\mathcal{L}_{[m]}\) close to \(\mathcal{L}_{[m_{sat}]}\), we prefer the simpler model.

Therefore, we compare \(\mathcal{L}\) the log of the likelihood of our model, with \(\mathcal{L}_{[m]}\), the log of the likelihood of the saturated model \([m_{sat}]\).

Example.

  • If \(Y_i\vert x_i\sim\mathcal{B}(p(x_i))\), then for the saturated model \([m_{sat}]\), we have: \[\widehat{\text{E}[Y_i\vert x_i]}=\widehat{p}(x_i)=y_i.\] In this case, the log of the likelihood is zero \[\mathcal{L}_{[m_{sat}]}=\sum_{i=1}^n \log\left(\widehat{p}(x_i)^{y_i} (1-\widehat{p}(x_i))^{1-y_i}\right) =0 \]

  • If \(Y_i\vert x_i\sim\mathcal{B}(n,p(x_i))\), then for the saturated model \([m_{sat}]\), we have: \[\widehat{\text{E}[Y_i\vert x_i]}=n\widehat{p}(x_i)=y_i.\]
    Here, the log of the likelihood is not zero \[\mathcal{L}_{[m_{sat}]}=\sum_{i=1}^n \log\left(\left(^n_{y_i}\right)(\widehat{p}(x_i))^{y_i} (1-\widehat{p}(x_i))^{1-y_i}\right) \neq 0 . \]


[Definition 3]
We The deviance of a model \([m]\) defined with respect to the saturated model \([m_{sat}]\) is denoted \(\mathcal{D}_{[m]}\) and is equal to \[\mathcal{D}_{[m]}=2\left(\mathcal{L}_{[m_{sat}]}-\mathcal{L}_{[m]}\right)\geq0,\] where \(\mathcal{L}_{[m_{sat}]}\) and \(\mathcal{L}_{[m]}\) represent the log likelihoods in the saturated model and in the model \([m]\), respectively.

📝 Remark.
    It is evident that a higher deviance \(\mathcal{D}_{[m]}\) indicates a poorer fit for the model \([m]\).

The residual deviance corresponds to the value of the deviance of our model \([\texttt{mod}]\) \[\mathcal{D}_{[\texttt{mod}]}(\widehat{p})=2(\mathcal{L}_{[m_{sat}]}(\widehat{p})-\mathcal{L}_{[\texttt{mod}]}(\widehat{p}))=-2\mathcal{L}_{[\texttt{mod}]}(\widehat{p})\]

model_glm$deviance
## [1] 311.7008

The null deviance represents the value of the deviance of the null model (which includes only the intercept): \[ \mathcal{D}_{[\texttt{nul}]}(\widehat{p})=-2\mathcal{L}_{[\texttt{nul}]}(\widehat{p})=-2\mathcal{L}_{[\texttt{nul}]}(\overline{y}) \]

model_glm$null.deviance
## [1] 347.8364

📈 The deviance of our model is significantly lower than the null deviance (the deviance of the model reduced to the intercept). This indicates that our model provides a better fit than the null model.

4. Diagnostic plots and Deviance residuals

4.1.Types of residuals

In a linear regression setting, residuals are defined as the difference between the observed values \(y_i\) and the predicted values \(\widehat{\text{E}[Y_i\vert x_i]}=\widehat y_i\). However, in a more general context, it is possible that \(\widehat{\text{E}[Y_i\vert x_i]}\neq \widehat y_i\).

😕 In logistic regression, the outcome data is discrete, and consequently, the residuals are also discrete. Therefore, plots of raw residuals from logistic regression are generally not very informative.


1. Response residuals In logistic regression, the residuals are defined as the difference between the observed values \(y_i\) and the predicted probabilities \(\widehat{p}(x_{i})=g^{-1}(x_{i}^T\widehat\beta)\): \[ res_{response_i}:=\widehat\epsilon_{i}=y_{i}-\widehat{p}_{i}. \]

To display the response residuals, you can use the following command:

# response residuals=y_i-hat p_i
model_glm%>% residuals(type='response') %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7175 -0.2999 -0.1586  0.0000  0.4522  0.9018

2. Pearson residuals The Pearson residuals are obtained by normalizing the response residuals \(res_{response_i}\) using the estimated variance \(\widehat{\text{Var}(Y_{i})}\), the estimated variance of \(Y_{i}\). In a logistic context, this variance is given by: \[\widehat{\text{Var}(Y_{i})}=\widehat{p}(x_{i})( 1-\widehat{p}(x_{i})).\] The Pearson residuals are then defined as: \[ res_{pearson_i} =\frac{res_{response_i}}{\sqrt{\widehat{p}(x_{i})( 1-\widehat{p}(x_{i}))}} \]

To display Pearson residuals, use the following command:

model_glm%>% residuals(type='pearson') %>% summary()
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.593811 -0.654480 -0.434174 -0.001998  0.908498  3.030883

3. Standardized Pearson residuals The Standardized Pearson residuals \(res_{Stand.pearson_i}\) are calculated by normalizing the Pearson residuals \(res_{pearson_i}\) with the leverage effect: \[ res_{Stand.pearson_i} =\frac{res_{pearson_i}}{\sqrt{{(1-h_{ii}))}}}, \] where \(h_{ii}\) is the \(i^{th}\) diagonal element of the projection matrix \(H=X(X^TX)^{-1}X^T\) in the full rank setting of the matrix \(X\).

To display these residuals, use:

model_glm%>% rstandard(type='pearson') %>% summary()
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.618585 -0.661913 -0.437873 -0.001867  0.921366  3.056838

4. Deviance residuals The summary of model_glm includes the deviance residuals, which are suitable for generalized models. These residuals measure how far \(\mathcal{L}_{[m]}(\widehat\beta,y)\) for the \(i\)-th observation is from the saturated \(\mathcal{L}_{[m_{sat}]}(\widehat\beta_{sat},y)\). To define them:

  • The deviance is calculated as: \[\mathcal{D}_{[m]}=2\left( \mathcal{L}_{[m_{sat}]}(\widehat\beta_{sat},y)-\mathcal{L}_{[m]}(\widehat\beta,y) \right) \] where \(\widehat\beta\) and \(\widehat\beta_{sat}\) are the maximum likelihood estimators obtained from the models \([m]\) and \([m_ {sat}]\), respectively.

  • For any model \([m]\),, the log of the likelihood \(\mathcal{L}_{[m]}\) can be expressed as a sum of contributions from each data point: \[ \mathcal{L}_{[m]}(\beta,y)=\sum_{i=1}^n \log\left( p(x_i)^{y_i} (1- p(x_i))^{1-y_i}\right):=\sum_{i=1}^n \mathcal{L}_{[m]}(\beta,y)|_i \]

Thus, the deviance can also be expressed as a sum of contributions from each observation: \[ \mathcal{D}_{[m]}:=\sum_{i=1}^n d_i \] where \(d_i\) represents the deviance due to observation \(i\). Using this formulation, the deviance residual for observation \(i\) is defined as: \[ res_{dev_i} ={\rm sign}(y_i-\widehat{p}_i)\sqrt{ d_i}. \] To display the deviance residuals, use the following command:

#by default :  model_glm%>% residuals() %>% summary()
model_glm%>% residuals(type = "deviance") %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5901 -0.8444 -0.5877 -0.1047  1.0971  2.1545

5. Standardized deviance residual

Standardized deviance residuals measure how far \(\mathcal{L}_{[m]}(\widehat\beta,y)\) for the \(i\)-th observation is from \(\mathcal{L}_{[m_{sat}]}(\widehat\beta_S,y)\) for the same observation, but they are normalized through the leverage effect: \[ res_{stand\_dev_i} =\sqrt{\frac{res_{dev_i}}{(1-h_{ii})}}. \]

To compute these residuals, use:

model_glm%>% rstandard(type='deviance') %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6148 -0.8537 -0.5927 -0.1055  1.1121  2.1730
📝 Remark.
    If the model is “good,” we can show that the residuals are asymptotically Gaussian (this can be verified using a Q-Q plot).


[Which Residuals Should You Focus On?]

As is often the case in statistics, the answer depends on the context!

  • Response residuals are useful if you want to explain a specific data point to someone who is not statistically trained, as they are on a more intuitive scale.
  • Deviance residuals are appropriate when you want to be consistent with the underlying mathematics, as they are based on likelihood, which is also the basis for fitting models in generalized linear models (GLMs).
  • Pearson residuals (and other standardized residuals) can help identify if a particular point is truly unusual since they are scaled.


4.2. Usual diagnostic plots

1. Residuals vs. Fitted plot

The Residuals vs. Fitted plot is typically used to check for trends in the residuals. This plot utilizes Pearson residuals.

autoplot(model_glm,1)

😕 In logistic regression, the discrete nature of the residuals means that plots of raw residuals are generally not informative.


2. Scale-Location plot The Scale-Location plot is commonly employed to detect heteroskedasticity. Here, Standardized Pearson residuals \(res_{Stand.pearson_i}\) are used here.

autoplot(model_glm,3)

😕 Due to the nature of the response variable \(Y\), the residuals are often heteroskedastic, rendering this plot less useful.


3. Normal Q-Q plot_ In a linear setting, the Q-Q plot is used to assess the normality of the residuals.

autoplot(model_glm,2)

😕 Although standardized Pearson residuals have an approximate standard normal distribution (as noted in Agresti, 2002), interpreting the plot can be challenging. Therefore, this Normal Q-Q plot is not particularly useful.

4.3. New diagnostic plots

1. Normal Q-Q plot We can generate a Normal Q-Q plot based on either Standardized Pearson residuals or Standardized Deviance residuals.

residual_glm<-rstandard(model_glm,type = 'deviance')
#residual_glm<-rstandard(model_glm,type = 'pearson')


ggplot(train,aes(sample = residual_glm,col=admit)) +geom_qq(size=1)+geom_qq_line(col='black')+ facet_wrap(~admit)+ 
  ylab("Standardized deviance residuals Quantiles") + xlab("Theoretical Quantiles") +
  ggtitle("Normal Q-Q plot") 


📝 Remarks.
  • The function geom_qq(distribution = stats::qnorm) defaults to a normal distribution.
  • The geom_qq_line() function adds a line to the theoretical Q-Q plot that passes through the first and third quartiles.

📈 The quantiles of our sampled data closely align with the theoretical quantiles, indicating that our residuals are approximately normally distributed.


2. An alternative to Residuals vs Fitted plot It is essential to ensure that there is no underlying structure or trend in the residuals. If a trend is detected, it may be necessary to investigate the cause, such as a poor model fit or a nonlinear relationship.

The Binned Residuals Plot serves as an alternative to the Residuals vs. Fitted plot. It divides the data into categories (bins) based on their fitted values, plotting the average residual against the average fitted value for each bin.

  • First, it groups points with similar predicted probabilities into bins.
  • After segmenting the data into \(J\) bins based on fitted values, it plots the average residuals (Expected values) versus the average Response residuals (Average residual) for each bin.

Expected residuals. The difference between the proportion of Yes responses observed among the points in a bin and the average predicted probability for those points. \[ res_{Expected_j}:=\frac{\sum_{bin_j}y_i}{\#\,bin_j}-\frac{\sum_{bin_j}\widehat{p}(x_i)}{\#\,bin_j} \] Example. For a group of 20 points with 11 Yes responses and an average prediction of 0.6, the expected residual would be: (11/20 - 0.6)=-0.05.

Average residuals. \[ res_{Average_j}:=\frac{\sum_{bin_j}(y_i-\widehat{p}(x_i))}{\#\,bin_j}=\frac{\sum_{i\in bin_j}res_{response_i}}{\#\,bin_j} \]

The binnedplot function from the arm package is used to create this binned residual plot.

binnedplot(fitted(model_glm), residuals(model_glm, type = "response"),col.int = "blue",col.pts = 2)

📈 The graph produced by binnedplot also displays a 95% prediction interval (indicated by blue lines) for the mean residuals. If the binomial model is adequate, approximately 95% of the residuals should fall within this interval, which is indeed the case here.


📝 Remarks.
  • By default, binnedplotselects the number of groups based on a balance between having enough points per bin (to ensure accurate average proportions) and enough bins (to observe trends if they exist). For \(n>100\), the number of groups chosen is approximately \(\sqrt n\).
  • The binnedplot function can also be used to investigate residuals in relation to individual predictors (see Gelman, page 98, for an example).

5. Odds ratio (OR)

In logistic regression, the coefficients can be interpreted as Odds Ratios (OR), which provide insight into the relationship between predictor variables and the probability of a particular outcome.

5.1. Definitions


[Definition 4]
The odd for an individual \(x\) to obtain the answer \(Y=1\) is defined by
\[odds(x)=\frac{p(x)}{1-p(x)},\quad\text{where}\quad p(x)=P(Y=1/x).\]


📝 Remark.
    Odds and probabilities represent the same underlying information but in different forms: \[ odds=\frac{p}{1−p}\quad \text{and}\quad p=\frac{odds}{1+odds}. \]

Example. An odds of 1 is equivalent to a probability of \(p=0.5\)


The ratio of two odds is termed the Odds Ratio (OR) and is defined as follows.


[Definition 5]
The Odds ratio (OR) between 2 individuals \(x\) and \(x'\) is defined as: \[OR(x,x')=\frac{odds(x)}{odds(x')}=\frac{p(x)(1-p(x'))}{p(x')(1-p(x))}.\]


Use of OR. Odds Ratios allow for the comparison of probabilities of “success” between individuals \(x\) and \(x'\):

  • \(OR(x,x')>1\) \(\Leftrightarrow\) \(p(x)>p(x')\),
  • \(OR(x,x')=1\) \(\Leftrightarrow\) \(p(x)=p(x')\),
  • \(OR(x,x')<1\) \(\Leftrightarrow\) \(p(x)<p(x')\).

Other interpretation: change in probability. An Odds Ratio can also be interpreted in terms of a change in probability.

Example. A change in probability from \(p_1=0.33\) to \(p_2=0.5\) results in an OR of 2, as follows: \[ \frac{ \frac{0.5}{0.5}}{ \frac{0.33}{0.66}}=2 \] This indicates that the odds of success have doubled.


Other interpretation: percentage Increase in Odds. We can also interpret the OR as reflecting the percentage increase in the odds of an event. Specifically, an OR of 2 implies that the odds have increased by 100%.

Example. If the initial odds are calculated as follows: - For \(p_1 = 0.33\) and \(p_2= 0.R\): \[ odds_1 = \frac{0.33}{1 - 0.33} = \frac{0.33}{0.67} \approx 0.49\quad\text{and}\quad odds_2 = \frac{0.5}{1 - 0.5} = \frac{0.5}{0.5} = 1. \]

The increase in odds from approximately 0.49 to 1 represents a change of: \[ \text{Percentage Increase} = \left( \frac{odds_2 - odds_1}{odds_1} \right) \times 100\% = \left( \frac{1 - 0.49}{0.49} \right) \times 100\% \approx 104.08\%. \]

This shows that moving from an odds of approximately 0.49 to 1 corresponds to an increase of over 104%, which aligns with the interpretation of the Odds Ratio as an indicator of relative change.


Other interpretation: relative risk. The OR can be related to relative risk. For probabilities \(p (x)\) and \(p(x')\) that are very small compared to 1, the OR can be approximated as: \(OR (x,x') \approx p (x)/p (x'),\) which provides a relative interpretation between \(x\) and \(x'\).

Example. In the context of a rare disease (where \(Y=1\) indicates the presence of the disease), if: \[OR(x',x)\sim p(x)/p(x')=4,\] it indicates that the response (illness) is four times more likely for individual \(x\) than for individual \(x'\)


Other interpretation : Impact of an explanatory variables. In a logistic regression model, the log-odds of the probability can be expressed as: \[ \begin{array}{l} logit(p(x))=\beta_0+\beta_1x_1+\ldots+\beta_px_p,\\ logit(p(x'))=\beta_0+\beta_1x'_1+\ldots+\beta_px'_p. \end{array} \] Consequently, the Odds Ratio can be expressed as: \[ OR(x,x')=\exp\left( \beta_1(x_1-x'_1)+\ldots+\beta_p(x_p-x'_p) \right). \] To examine the influence of an explanatory variable \(x_j\) on the Odds Ratio, we consider two observations \(x\) and \(x'\) that differ only in the \(j\)-th variable (i.e., \(x_i=x'_i\) for all \(i\neq j\)and \(x_j\neq x'_j\)). The Odds Ratio simplifies to: \[ OR(x,x')=\exp\left( \beta_j(x_j-x'_j) \right). \] If, \(x_j-x'_j=1\) then \(OR(x,x')=\exp( \beta_j),\) which allows for the study of the influence of the variable \(j\) on the response variable without dependency on the values of \(x\). These values can be obtained directly from .

Example. Consider two predictors \(X^{(1)}\) and \(X^{(2)}\) such that \[\begin{array}{ll} \beta_1=0.5&\Rightarrow\text{e}^{\beta_1}=\text{e}^{0.5}\approx 1.65\\ \beta_2=-0.15&\Rightarrow\text{e}^{\beta_2}=\text{e}^{-0.15}\approx 0.86 \end{array} \]

These exponentiated coefficients can be interpreted as the percentage or multiplicative change in the odds associated with a one-unit increase/decrease in the predictor (while holding the other predictors constant): \[\begin{array}{lll} \text{Predictor 1:}&\text{from 1 to 1.65 }&\Rightarrow\text{ 65\% of increase}\\ \text{Predictor 2:}&\text{from 1 to 0.86 }&\Rightarrow\text{ 14\% of decrease} \end{array} \]


Interpretation in . For a given explanatory variable

  • \(OR=1\) indicates “no effect”.
  • \(OR>1\) suggests “an increase” in the phenomenon being studied (e.g. Yes).
  • \(OR<1\) indicates “a decrease” in the phenomenon being studied.


5.2. OR with R

In this section, we will calculate the Odds Ratios (OR) for our logistic regression model, including their confidence intervals and associated pp-values. The null hypothesis (\(\mathcal{H}0\)) posits that there is no effect, while the alternative hypothesis (\(\mathcal{H}1\)) suggests the presence of an effect.


Displaying Odds Ratios and Confidence Intervals To obtain the Odds Ratios along with their confidence intervals and p-values, we can use the odds.ratio function from the questionr package. This will provide a clear summary of how each predictor variable affects the outcome.

library(questionr)
odds.ratio(model_glm)%>%  kableExtra::kbl() %>% kableExtra::kable_styling()
OR 2.5 % 97.5 % p
(Intercept) 0.0213227 0.0012976 0.3150431 0.0058662
rank2 0.4968580 0.2308283 1.0529672 0.0695536
rank3 0.1812387 0.0768620 0.4125094 0.0000630
rank4 0.1636270 0.0544679 0.4460768 0.0006618
gpa 2.4195787 1.0950123 5.4800698 0.0309484
gre 1.0017814 0.9991697 1.0044581 0.1847689

📈 Interpretation

  • Intercept. An OR of 0.0213227` indicates that when all predictors are at their reference levels, the odds of the outcome occurring are very low.

  • rank2. An OR of 0.496858 suggests that moving from the reference rank to rank 2 reduces the odds of the outcome by approximately 50.31% compared to the reference rank, but this result is marginally significant 0.0695536.

  • rank3. An OR of 0.1812387 indicates that moving to rank 3 decreases the odds of the outcome significantly (about 81.88 %) compared to the reference rank. This is highly significant (6.2998787^{-5}).

  • rank4. Similarly, an OR of 0.163627 indicates that moving to rank 4 decreases the odds of the outcome by approximately 83.64 %. This result is also highly significant (6.6177874^{-4}).

  • gpa. An OR of 2.4195787 suggests that for each unit increase in gpa, the odds of the outcome occurring increase by about -83.64 %. This effect is statistically significant (0.0309484).

  • gre. An OR of 1.0017814 indicates a negligible increase in odds for each unit increase in gre score, and the p-value ((0.1847689)) suggests that this effect is not statistically significant.

  • The 2.5% and 97.5% confidence intervals provide a range of values for the OR. For instance, the OR for rank 3 ranges from 0.076862 to 0.4125094, which does not include 1, indicating a significant effect. Conversely, for gre, the confidence interval includes 1 (0.9991697 to 1.0044581, suggesting that it may not have a significant impact.


Visualizing Odds Ratios with a Forest Plot A forest plot is an effective way to visually represent the Odds Ratios, their confidence intervals, and significance levels. We can use the forest_model function from the forestmodel package to generate this plot.

#library(forestmodel)
forest_model(model_glm)

6. Variables selection

6.1. Test of Type I and II

We start with a Type I test, which evaluates the significance of each predictor in the model sequentially. The anova() function output shows the deviance reductions for each predictor added sequentially.

Here’s the summary:

anova(model_glm,test="Chisq")

📈 These results suggest that both rank and gpa add value to the model, whereas gre may not be a necessary predictor.


Next, we perform Type II tests using the drop1 function. This function removes each variable one at a time and evaluates the change in deviance to determine if the model fit significantly worsens without that variable. This approach is useful to check if each variable contributes significantly to the model (note: this method is not compatible with multinom models).

    ⚠️This method is not compatibleis not compatible with multinom models.

Here’s the drop1()output:

drop1(model_glm, test = "Chisq")

📈 Both Type I and Type II tests highlight rank and gpa as significant predictors, while gre appears unnecessary.

6.2. Compare models

Based on the results above, we’ll explore different model configurations to identify the best-fitting model for predicting admit.

Here are the models:

model_glm<-glm(admit~rank+gpa+gre,family='binomial',data=train)
modF2<-glm(admit~rank+gre,family='binomial',data=train)
modF<-glm(admit~rank+gpa,family='binomial',data=train)
modNull<-glm(admit~1,family='binomial',data=train)

To compare these models, we can use the LRstats() function from the vcdExtra package, which provides an asymptotic Goodness-of-Fit test by deviance to check if a model \([mod]\) of size \(m\) is adequate to explain the data.

\[ \begin{cases} \mathcal{H}_0 :& \mathcal{D}_{[mod]}=0&\text{model is adequate}\\ \mathcal{H}_1 :& \mathcal{D}_{[mod]}>0&\text{model is inaadequate}\\ \end{cases} \]


📝 Remarks.
  • This test assesses if the model differs significantly from a saturated model.
  • Under \(\mathcal{H}_0\), the test implies that there is no significant difference between the models.


[Proposition 6]
Under certain conditions and under \(\mathcal{H}_0\) \[\mathcal{D}_{[mod]}\overset{\mathcal{D}}{\longrightarrow} \chi^2_{n-m}, \] where \(n−m\) represents the degrees of freedom. For a given significance level \(\alpha\in(0,1)\), the rejection region is: \[\left\{\mathcal{D}_{[mod]}\geq q_{1-\alpha}^{ \chi^2_{n-m}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{n-m}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with \(n-m\) degrees of freedom.

Here’s the LRstats()output:

#library(vcdExtra)
LRstats(model_glm,modF,modF2,modNull)

📈 Interpretation

  • AIC and BIC. The AIC and BIC values are lowest for modF and model_glm, suggesting they are the best-fitting models among those considered.
  • For the likelihood ratio tests. The full model (model_glm) is nearly significant (\(\approx\) 0.05), suggesting it fits reasonably well but may include unnecessary complexity.
  • modF is the preferred model, as it balances model fit with simplicity, making it both interpretable and efficient for prediction.

6.3. Comparing Nested Models

To delve deeper, we compare nested models using a Likelihood Ratio (LR) test. This test evaluates whether adding parameters (moving from a simpler model to a more complex model) significantly improves fit.


[Proposition 7]

Consider two nested models \([m_0]\) and \([m_1]\), where (\([m_0]\subset [m_1]\)). The hypotheses are: \[ \begin{cases} \mathcal{H}_0 :& [m_0] &\text{ is adequate}\\ \mathcal{H}_1 :& [m_1] &\text{ is adequate}\\ \end{cases} \]

Under \(\mathcal{H}_0\),the LR test statistic is defined as: \[\text{LRtest}:=2(\mathcal{L}_{[m_1]}-\mathcal{L}_{[m_0]})=(\mathcal{D}_{[m_0]}-\mathcal{D}_{[m_1]})\overset{\mathcal{D}}{\longrightarrow} \chi^2_{m_1-m_0}.\] where \(m_1−m_0\) represents the degrees of freedom. For a given significance level \(\alpha\in(0,1)\), the rejection region is: \[\left\{\text{LRtest}\geq q_{1-\alpha}^{ \chi^2_{m_1-m_0}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{n-m}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with \(m_1-m_0\) degrees of freedom.

To perform that test, we can use the lrtest function of the lmtestpackage.

#library(lmtest)
lrtest(modF,model_glm)

📈 The result confirms that modF is preferable over model_glm.


Testing for Interactions. Lastly, we consider whether interaction terms improve the model fit. We introduce an interaction between gpa and rank:

modF_int<-glm(admit~gpa*rank,family='binomial',data=train)
lrtest(modF, modF_int)

📈 The test indicates that adding the interaction term does not improve the model, so we maintain modF as the preferred model.

7. Visualizing Variable Effects

Graphical representation of each predictor’s effect on the target variable can provide valuable insights into model behavior, especially in understanding how individual variables influence predictions. The allEffects function from the effects package allows us to plot the estimated effects of each predictor in the logistic model, helping us visualize trends and relationships.

    ⚠️The effects package works for glm models but is not compatible with multinom models.
#library(effects)
plot(allEffects(modF))

📈 Interpretation

  • Effect of gpa on Admission. The plot shows the relationship between gpa and the likelihood of admission. As gpa increases, the likelihood of admission also increases. Students with a higher gpa have a greater probability of being admitted. This aligns with the model’s intuition that academic performance (reflected in gpa) positively influences admission outcomes.
  • Effect of rank on Admission. The plot for rank illustrates the probability of admission across different ranks of the applicant’s undergraduate institution. As rank improves (lower rank number, indicating a more prestigious institution), the likelihood of admission increases significantly. Applicants from higher-ranked schools (lower numerical rank value) have a better chance of admission. The model attributes significant weight to the rank variable, suggesting it’s a crucial factor in the admission decision.
  • By viewing these plots side-by-side, we can directly observe and compare how gpa and rank each contribute to the predicted probability of admission. Such comparisons are essential in assessing variable importance and informing strategies for improving admission odds. These plots provide actionable insights. For example, applicants might prioritize institutions with a higher rank or aim to improve their gpa to enhance their chances.
  • If there are interaction effects between variables, they may not be fully captured in these individual effect plots. Interaction terms, if present in the model, should also be visualized separately to understand their combined influence on admission probability.